Manual model development
You are given a set of data on housing sale prices for the last few years in King County (near Seattle) between May 2014 and May 2015.
We want you to build an explanatory model for the price of housing in King County, i.e. an interpretable model in which the included variables are statistically justifiable.
library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1 ── Conflicts ───────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(fastDummies)
Thank you for using fastDummies!
To acknowledge our work, please cite the package:
Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL: https://github.com/jacobkap/fastDummies, https://jacobkap.github.io/fastDummies/.
library(mosaicData)
library(tidyverse)
library(janitor)
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
library(ggfortify)
library(mosaic)
Registered S3 method overwritten by 'mosaic':
method from
fortify.SpatialPolygonsDataFrame ggplot2
The 'mosaic' package masks several functions from core packages in order to add
additional features. The original behavior of these functions should not be affected by this.
Attaching package: ‘mosaic’
The following object is masked from ‘package:Matrix’:
mean
The following objects are masked from ‘package:dplyr’:
count, do, tally
The following object is masked from ‘package:purrr’:
cross
The following object is masked from ‘package:ggplot2’:
stat
The following objects are masked from ‘package:stats’:
binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test, quantile, sd, t.test, var
The following objects are masked from ‘package:base’:
max, mean, min, prod, range, sample, sum
library(skimr)
Attaching package: ‘skimr’
The following object is masked from ‘package:mosaic’:
n_missing
Load in data, explore and decide what to do with variables
house_prices <- read_csv("data/kc_house_data.csv")
Rows: 21613 Columns: 21── Column specification ───────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): id
dbl (19): price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_...
dttm (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# columns date, id, sqft_living15, sqft_lot15 and zipcode non informative
house_tidy <- house_prices %>%
select(-id, -date, -sqft_living15, -sqft_lot15, -zipcode) %>%
#waterfront is a logical, TRUE/FALSE so lets convert:
mutate(waterfront = if_else(waterfront > 0, TRUE, FALSE)) %>%
#lets convert yr_renovated into a logical also, to see if renovation has a effect
mutate(renovated = if_else(yr_renovated > 0, TRUE, FALSE)) %>%
select(-yr_renovated) %>%
# Variables, 'view', 'condition' and 'grade' can be seen as ordinal data.
# they are representing non-mathematical ideas (eventhough written here as number)
mutate(view = as_factor(view),
condition = as_factor(condition),
grade = as_factor(grade))
Check for aliased variables using the alias() function (this takes in a formula object and a data set).
alias(lm(price ~ ., data = house_tidy))
Model :
price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
waterfront + view + condition + grade + sqft_above + sqft_basement +
yr_built + lat + long + renovated
Complete :
(Intercept) bedrooms bathrooms sqft_living sqft_lot floors waterfrontTRUE view1 view2 view3 view4 condition2
sqft_basement 0 0 0 1 0 0 0 0 0 0 0 0
condition3 condition4 condition5 grade3 grade4 grade5 grade6 grade7 grade8 grade9 grade10 grade11 grade12
sqft_basement 0 0 0 0 0 0 0 0 0 0 0 0 0
grade13 sqft_above yr_built lat long renovatedTRUE
sqft_basement 0 -1 0 0 0 0
So based on this output we see that sqft_basement can be calculated as: sqft_basement = 1 * sqft_living + -1 * sqft_above
So I suggest to remove sqft_basement
house_tidy_2 <- house_tidy %>%
select(-sqft_basement)
alias(lm(price ~ ., data = house_tidy_2))
Model :
price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
waterfront + view + condition + grade + sqft_above + yr_built +
lat + long + renovated
Success!!!
Systematically build a regression model containing up to four main effects (remember, a main effect is just a single predictor with coefficient), testing the regression diagnostics as you go * splitting datasets into numeric and non-numeric columns might help ggpairs() run in manageable time, although you will need to add either a price or resid column to the non-numeric dataframe in order to see its correlations with the non-numeric predictors.
# start checking correlations for numeric only first, than non-numeric..
# to keep things manageable!
# Numeric
house_tidy_numeric <- house_tidy_2 %>%
select_if(is.numeric)
# non-numeric
house_tidy_nonnum <- house_tidy_2 %>%
select_if(~!is.numeric(.))
# add price column to the non-numeric data
house_tidy_nonnum$price <- house_tidy_2$price
First predictor First ggpairs of numeric variables:
Based on the plot, highest correlation with price = sqft_living Let’s create a model:
model_1 <- lm(price ~ sqft_living, house_tidy_2)
autoplot(model_1)
summary(model_1)
Call:
lm(formula = price ~ sqft_living, data = house_tidy_2)
Residuals:
Min 1Q Median 3Q Max
-1476062 -147486 -24043 106182 4362067
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -43580.743 4402.690 -9.899 <2e-16 ***
sqft_living 280.624 1.936 144.920 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 261500 on 21611 degrees of freedom
Multiple R-squared: 0.4929, Adjusted R-squared: 0.4928
F-statistic: 2.1e+04 on 1 and 21611 DF, p-value: < 2.2e-16
# Residual standard error: 261500
# Multiple R-squared: 0.4929
# significant relation
Second predictor
Question: why do we add the residuals and check correlations using that?
library(modelr)
Attaching package: ‘modelr’
The following object is masked from ‘package:mosaic’:
resample
The following object is masked from ‘package:ggformula’:
na.warn
house_tidy_numeric_remaining_resid <- house_tidy_numeric %>%
add_residuals(model_1) %>%
select(-c("price", "sqft_living"))
ggpairs(house_tidy_numeric_remaining_resid)
Based on the plot, highest correlation now with residuals = latitude Let’s add it as predictor to our model:
autoplot(model_2)
summary(model_2)
Call:
lm(formula = price ~ sqft_living + lat, data = house_tidy_2)
Residuals:
Min 1Q Median 3Q Max
-1487994 -125643 -20309 84613 4368717
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.416e+07 5.653e+05 -60.44 <2e-16 ***
sqft_living 2.749e+02 1.794e+00 153.27 <2e-16 ***
lat 7.177e+05 1.189e+04 60.36 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 241900 on 21610 degrees of freedom
Multiple R-squared: 0.566, Adjusted R-squared: 0.566
F-statistic: 1.409e+04 on 2 and 21610 DF, p-value: < 2.2e-16
Improved in comparison to model_1, so let’s continue and add another predictor
Based on the plot, all correlations are pretty weak, so I want to check the non-numerical dataset now…!
Based on this plot, it looks like “grade” has an effect on the price. The boxplots of grade show splitting. So lets add grade to our model as predictor
model.matrix(model_2)
(Intercept) sqft_living lat
1 1 1180 47.5112
2 1 2570 47.7210
3 1 770 47.7379
4 1 1960 47.5208
5 1 1680 47.6168
6 1 5420 47.6561
7 1 1715 47.3097
8 1 1060 47.4095
9 1 1780 47.5123
10 1 1890 47.3684
11 1 3560 47.6007
12 1 1160 47.6900
13 1 1430 47.7558
14 1 1370 47.6127
15 1 1810 47.6700
16 1 2950 47.5714
17 1 1890 47.7277
18 1 1600 47.6648
19 1 1200 47.3089
20 1 1250 47.3343
21 1 1620 47.7025
22 1 3050 47.5316
23 1 2270 47.3266
24 1 1070 47.3533
25 1 2450 47.3739
26 1 1710 47.3048
27 1 2450 47.6386
28 1 1400 47.6221
29 1 1520 47.6950
30 1 2570 47.7073
31 1 2320 47.5391
32 1 1190 47.7274
33 1 2330 47.6823
34 1 1090 47.6889
35 1 2060 47.4276
36 1 2300 47.6827
37 1 1660 47.6621
38 1 2360 47.6702
39 1 1220 47.3341
40 1 2620 47.5301
41 1 2570 47.6145
42 1 4220 47.4450
43 1 3595 47.6848
44 1 1570 47.6413
45 1 1280 47.4485
46 1 3160 47.7443
47 1 990 47.3066
48 1 2290 47.6194
49 1 1250 47.6796
50 1 2753 47.4041
51 1 1190 47.4258
52 1 3150 47.4934
53 1 1410 47.6808
54 1 1980 47.6989
55 1 2730 47.6571
56 1 2830 47.6597
57 1 2250 47.3663
58 1 2420 47.3663
59 1 3250 47.5880
60 1 1850 47.5059
61 1 2150 47.4336
62 1 1260 47.4366
63 1 2519 47.4428
64 1 1540 47.6765
65 1 1660 47.6113
66 1 2770 47.5747
67 1 2720 47.5815
68 1 2240 47.3378
69 1 1000 47.3621
70 1 3200 47.6303
71 1 4770 47.6525
72 1 1260 47.7362
73 1 2750 47.7168
74 1 2380 47.3608
75 1 1790 47.3511
76 1 3430 47.5822
77 1 1760 47.6034
78 1 1040 47.5636
79 1 1410 47.7073
80 1 3450 47.3420
81 1 2350 47.3512
82 1 1900 47.3490
83 1 2020 47.5474
84 1 1680 47.4811
85 1 960 47.7264
86 1 2140 47.6337
87 1 2660 47.2909
88 1 2770 47.5228
89 1 1610 47.5180
90 1 1030 47.5394
91 1 1980 47.2897
92 1 3520 47.6506
93 1 1200 47.3220
94 1 1580 47.6870
95 1 1580 47.6870
96 1 3300 47.5873
97 1 1960 47.3576
98 1 1160 47.3036
99 1 1810 47.4109
100 1 2320 47.4838
101 1 2070 47.6415
102 1 1980 47.6775
103 1 2190 47.7731
104 1 2920 47.5814
105 1 1210 47.4375
106 1 2340 47.4431
107 1 1670 47.6878
108 1 1240 47.3813
109 1 3140 47.7304
110 1 2030 47.6417
111 1 2310 47.5386
112 1 1260 47.6823
113 1 1540 47.3624
114 1 2080 47.5474
115 1 3230 47.3183
116 1 4380 47.6981
117 1 1590 47.6824
118 1 880 47.5009
119 1 1570 47.4965
120 1 1610 47.5870
121 1 2400 47.7728
122 1 1450 47.7288
123 1 770 47.6999
124 1 2100 47.5299
125 1 2910 47.6897
126 1 2750 47.6141
127 1 2100 47.5091
128 1 2160 47.3129
129 1 2320 47.6763
130 1 2070 47.5319
131 1 1060 47.2761
132 1 2010 47.5343
133 1 3950 47.6970
134 1 2010 47.5517
135 1 2140 47.6734
136 1 1320 47.3257
137 1 2020 47.3309
138 1 2590 47.7689
139 1 1190 47.7135
140 1 1170 47.6722
141 1 1110 47.6338
142 1 2820 47.5707
143 1 1610 47.4563
144 1 1060 47.7144
145 1 2030 47.5495
146 1 3670 47.7421
147 1 2550 47.6354
148 1 2420 47.5262
149 1 2260 47.3887
150 1 1430 47.6873
151 1 1360 47.6838
152 1 1110 47.6550
153 1 1250 47.3255
154 1 5180 47.5620
155 1 700 47.6790
156 1 1180 47.4479
157 1 3960 47.5250
158 1 2640 47.3135
159 1 1270 47.2086
160 1 1760 47.4715
161 1 2060 47.4877
162 1 1780 47.7171
163 1 3400 47.6012
164 1 1910 47.7319
165 1 2020 47.4582
166 1 1580 47.7209
167 1 1340 47.4080
168 1 2680 47.5650
169 1 2680 47.7028
170 1 1370 47.7458
171 1 1560 47.4776
172 1 2160 47.3026
173 1 1340 47.7658
174 1 3880 47.6477
175 1 2590 47.5619
176 1 1120 47.6106
177 1 1970 47.5868
178 1 1220 47.6101
179 1 1950 47.1976
180 1 1350 47.7224
181 1 1670 47.3505
182 1 2380 47.5384
183 1 2440 47.7044
184 1 1050 47.3848
185 1 3130 47.6993
186 1 4090 47.6627
187 1 1490 47.3099
188 1 1900 47.6200
189 1 1330 47.6500
190 1 2230 47.6647
191 1 1650 47.7260
192 1 1190 47.4616
193 1 2140 47.5505
194 1 2180 47.7606
195 1 1060 47.7481
196 1 1690 47.2779
197 1 1970 47.5511
198 1 2150 47.5488
199 1 1910 47.5385
200 1 1350 47.4058
201 1 860 47.5093
202 1 1940 47.3777
203 1 1010 47.6750
204 1 1300 47.3025
205 1 910 47.4787
206 1 2480 47.5378
207 1 2440 47.7073
208 1 1010 47.5733
209 1 900 47.4604
210 1 2300 47.7067
211 1 1550 47.3540
212 1 1270 47.6647
213 1 2240 47.6143
214 1 2714 47.3185
215 1 1720 47.5458
216 1 850 47.4889
217 1 3300 47.5673
218 1 2250 47.5133
219 1 3900 47.5884
220 1 1320 47.7145
221 1 2760 47.5836
222 1 1750 47.3980
223 1 2330 47.7663
224 1 2220 47.3758
225 1 2020 47.3828
226 1 1250 47.5123
227 1 1510 47.7076
228 1 1720 47.2922
229 1 1430 47.4075
230 1 1480 47.6794
231 1 1450 47.4497
232 1 2280 47.5218
233 1 2940 47.3103
234 1 1000 47.5687
235 1 2480 47.5575
236 1 3760 47.6489
237 1 2220 47.3459
238 1 1970 47.6136
239 1 3830 47.7641
240 1 4410 47.3376
241 1 1430 47.3173
242 1 830 47.5308
243 1 1430 47.6727
244 1 1300 47.3221
245 1 1030 47.6779
246 1 2740 47.3076
247 1 3650 47.6338
248 1 720 47.5218
249 1 2010 47.2785
250 1 1560 47.6846
251 1 1810 47.3656
252 1 3360 47.4369
253 1 1510 47.4616
254 1 1400 47.5405
255 1 1730 47.3507
256 1 1420 47.6979
257 1 2360 47.5278
258 1 1580 47.4756
259 1 1230 47.4437
260 1 2460 47.7048
261 1 1660 47.6362
262 1 1270 47.4207
263 1 2100 47.6154
264 1 770 47.5964
265 1 760 47.4683
266 1 1700 47.4045
267 1 1120 47.5544
268 1 1070 47.6802
269 1 2070 47.2988
270 1 5050 47.6312
271 1 5310 47.7285
272 1 1040 47.7107
273 1 1700 47.3271
274 1 1300 47.5053
275 1 1080 47.6601
276 1 2653 47.4145
277 1 2290 47.4672
278 1 3820 47.7618
279 1 2210 47.3828
280 1 2390 47.5362
281 1 2600 47.3954
282 1 860 47.5048
283 1 3830 47.6166
284 1 3500 47.6811
285 1 2420 47.7486
286 1 2720 47.3846
287 1 2500 47.7564
288 1 1670 47.7198
289 1 2900 47.5461
290 1 1640 47.5712
291 1 1890 47.4281
292 1 2950 47.7128
293 1 2160 47.3341
294 1 3280 47.5399
295 1 1970 47.5717
296 1 3360 47.5951
297 1 1320 47.3536
298 1 2650 47.5084
299 1 2030 47.5586
300 1 1590 47.4427
301 1 4550 47.6053
302 1 2440 47.3494
303 1 1940 47.6933
304 1 2040 47.3537
305 1 2200 47.4760
306 1 1920 47.6239
307 1 1800 47.7776
308 1 2180 47.6459
309 1 1010 47.6692
310 1 3320 47.5376
311 1 2370 47.6010
312 1 1660 47.5290
313 1 3650 47.6345
314 1 4290 47.5503
315 1 4290 47.5503
316 1 1950 47.4441
317 1 2590 47.7357
318 1 1930 47.3460
319 1 1470 47.6638
320 1 800 47.4800
321 1 3150 47.6759
322 1 2030 47.5162
323 1 1450 47.2639
324 1 1510 47.2929
325 1 1240 47.4957
326 1 1240 47.4957
327 1 3030 47.7721
328 1 2050 47.6440
329 1 1000 47.4720
330 1 2370 47.5920
331 1 2800 47.6690
332 1 2240 47.4200
333 1 1810 47.1913
[ reached getOption("max.print") -- omitted 21280 rows ]
attr(,"assign")
[1] 0 1 2
Let’s also check if adding the categorical predictor ‘grade’ is also statistically justified:
anova(model_2, model_3)
Analysis of Variance Table
Model 1: price ~ sqft_living + lat
Model 2: price ~ sqft_living + lat + grade
Res.Df RSS Df Sum of Sq F Pr(>F)
1 21610 1.2641e+15
2 21599 1.0363e+15 11 2.2788e+14 431.79 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Improved in comparison to model_2, and adding “grade” is statistically justified! so let’s continue and add another predictor
question: how to deal with the size of this plot?
colnames(house_tidy_remaining_resid)
[1] "bedrooms" "bathrooms" "sqft_lot" "floors" "waterfront" "view" "condition" "sqft_above" "yr_built"
[10] "long" "renovated" "resid"
Difficult to see, but it looks like predictor “year_built” has still a negative correlation with the price. Let’s add to our model
autoplot(model_4)
summary(model_4)
Call:
lm(formula = price ~ sqft_living + lat + grade + yr_built, data = house_tidy_2)
Residuals:
Min 1Q Median 3Q Max
-1493935 -99411 -13537 72530 4481513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.132e+07 5.829e+05 -36.568 < 2e-16 ***
sqft_living 1.634e+02 2.456e+00 66.534 < 2e-16 ***
lat 5.566e+05 1.066e+04 52.198 < 2e-16 ***
grade3 7.742e+04 2.419e+05 0.320 0.748918
grade4 -2.924e+04 2.130e+05 -0.137 0.890852
grade5 -6.008e+04 2.099e+05 -0.286 0.774709
grade6 -4.141e+04 2.095e+05 -0.198 0.843332
grade7 2.096e+04 2.095e+05 0.100 0.920318
grade8 1.164e+05 2.095e+05 0.556 0.578484
grade9 2.505e+05 2.096e+05 1.195 0.232081
grade10 4.368e+05 2.097e+05 2.083 0.037287 *
grade11 7.204e+05 2.100e+05 3.431 0.000602 ***
grade12 1.241e+06 2.110e+05 5.881 4.14e-09 ***
grade13 2.370e+06 2.181e+05 10.865 < 2e-16 ***
yr_built -2.568e+03 5.712e+01 -44.968 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 209500 on 21598 degrees of freedom
Multiple R-squared: 0.6747, Adjusted R-squared: 0.6745
F-statistic: 3200 on 14 and 21598 DF, p-value: < 2.2e-16
Adding year built improved the model, and is significnat.
The final regression model containing four main effects =
price ~ sqft_living + lat + grade + yr_built
autoplot(model_5)
summary(model_5)
Call:
lm(formula = price ~ sqft_living + lat + grade + yr_built + yr_built:grade,
data = house_tidy_2)
Residuals:
Min 1Q Median 3Q Max
-1642632 -94343 -11736 67718 4170984
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.469e+06 3.511e+06 -2.412 0.01587 *
sqft_living 1.614e+02 2.414e+00 66.866 < 2e-16 ***
lat 5.764e+05 1.048e+04 55.023 < 2e-16 ***
grade3 -8.230e+06 1.256e+07 -0.655 0.51228
grade4 -2.206e+07 5.135e+06 -4.296 1.75e-05 ***
grade5 -1.829e+07 3.725e+06 -4.910 9.19e-07 ***
grade6 -1.760e+07 3.500e+06 -5.028 5.01e-07 ***
grade7 -1.559e+07 3.478e+06 -4.481 7.47e-06 ***
grade8 -1.378e+07 3.479e+06 -3.960 7.53e-05 ***
grade9 -8.745e+06 3.489e+06 -2.507 0.01220 *
grade10 -2.991e+06 3.516e+06 -0.851 0.39490
grade11 -8.168e+05 3.623e+06 -0.225 0.82162
grade12 -1.948e+07 4.041e+06 -4.821 1.44e-06 ***
grade13 2.477e+06 2.147e+05 11.536 < 2e-16 ***
yr_built -9.595e+03 1.767e+03 -5.430 5.69e-08 ***
grade3:yr_built 4.209e+03 6.451e+03 0.653 0.51408
grade4:yr_built 1.127e+04 2.631e+03 4.285 1.84e-05 ***
grade5:yr_built 9.319e+03 1.898e+03 4.909 9.21e-07 ***
grade6:yr_built 8.964e+03 1.780e+03 5.035 4.81e-07 ***
grade7:yr_built 7.951e+03 1.769e+03 4.495 6.99e-06 ***
grade8:yr_built 7.078e+03 1.769e+03 4.001 6.34e-05 ***
grade9:yr_built 4.616e+03 1.774e+03 2.602 0.00928 **
grade10:yr_built 1.820e+03 1.788e+03 1.018 0.30879
grade11:yr_built 8.778e+02 1.841e+03 0.477 0.63342
grade12:yr_built 1.051e+04 2.048e+03 5.131 2.91e-07 ***
grade13:yr_built NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 204900 on 21588 degrees of freedom
Multiple R-squared: 0.6889, Adjusted R-squared: 0.6886
F-statistic: 1992 on 24 and 21588 DF, p-value: < 2.2e-16
calc.relimp(model_4, type = "lmg", rela = TRUE)
Response variable: price
Total response variance: 134782378397
Analysis based on 21613 observations
14 Regressors:
Some regressors combined in groups:
Group grade : grade3 grade4 grade5 grade6 grade7 grade8 grade9 grade10 grade11 grade12 grade13
Relative importance of 4 (groups of) regressors assessed:
grade sqft_living lat yr_built
Proportion of variance explained by model: 67.47%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg
grade 0.45234793
sqft_living 0.41292696
lat 0.09809759
yr_built 0.03662753
Average coefficients for different model sizes:
1group 2groups 3groups 4groups
sqft_living 280.6236 245.6026 205.1658 163.3972
lat 813411.5832 734705.1674 614095.2077 556557.5636
grade3 63666.6672 69680.5509 73701.3572 77416.5146
grade4 72381.0351 34514.9701 86.4573 -29235.6859
grade5 106523.9716 46648.4183 -9624.3207 -60079.1942
grade6 159919.6380 89701.2340 21910.3579 -41409.2380
grade7 260590.2629 183125.4894 102723.8679 20956.8440
grade8 400852.7662 313128.1313 217874.3072 116416.1783
grade9 631513.1864 514630.5870 387044.8629 250482.0434
grade10 929771.0746 776778.8055 611642.6363 436768.8912
grade11 1354841.7274 1156878.2824 944314.7067 720448.1332
grade12 2049222.0006 1795614.3859 1524679.6350 1240983.8377
grade13 3567615.3852 3183496.6514 2781641.5453 2369661.2778
yr_built 675.0698 -1454.6530 -2561.6237 -2568.4167
So, we see by this measure that grade is most important (accounting for 45% of r2 ), followed by sqft_living (41% ), then latitude (9% ), and finally the yr_built (4% ).